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1.  Multiple  Regression 

Regression  is  a  method  for  modeling  a  set  of  response  variables  Yi  (1  <  t  <  q)  as 
functions  of  a  set  of  predictor  variables  Xj  (1  <  j  <  p)  based  on  matched  observations 
(training  data). 

yik,y2k,  ■■yqk,xik,x2k,--xpk  (o) 

Often  there  is  only  a  single  response  variable  (q  =  1).  Usually  the  goal  is  to  estimate 
the  conditional  expectation  of  each  Yi  given  a  set  of  values  for  the  predictor  variables 

( ’X\  >  X2t  •  •  *  Zp) 

Yi(xi, X2, . . .  ,xp)  =  E[Yi  \  Xi  =  xi,X2  =  X2,' •  •  ,Xp  =  zp]  (1  <  *  <  q)}  (1) 

as  the  predictor  variable  values  range  over  some  region  of  interest  in  Rp.  These  conditional 
expectation  estimates  are  then  used  as  best  guesses  for  the  true  underlying  response  values 
assuming  that  the  observed  responses  were  generated  from  a  noisy  process 

Yi  =  gi(Xi,X2,--  ,Xp)  +  ei  (1  <  i  <  q)  (2) 

where  the  <7,'  are  single  valued  functions  of  p  variables  and  £,•  is  a  random  variable  with  zero 
expectation.  The  conditional  expectations  ^(zj, za,  •  •  • , zp)  can  be  regarded  as  estimates 
for  the  gi(xi ,  z3,  •  •  • ,  zp)  (1  <  t  <  q). 

The  classical  linear  model  expresses  the  Yi  as  linear  functions  of  the  predictor  variables 

p  _ 

Yi(Xi  •  •  •  Zp)  =  ai0  +  ^2  OtijXj 

3=1 

where  the  values  of  the  o:,-y  are  chosen  to  be  those  for  which  the  expected  distant  between 
Yi  and  Yi  is  minimized.  Several  different  distance  measures  are  in  common  use,  but  the 
most  common  is  the  Euclidean 


L^on0--aip)  =  EY,x[Yi-Yi]2. 
The  resulting  estimates  are  termed  least-squares  estimates. 


(3) 
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Recently  Friedman  and  Stuetzle  (1981)  suggested  an  extension  to  the  basic  linear 
model  (termed  PPR  for  Projection  Pursuit  Regression).  It  has  the  form 


Mf 

XP )  =  fim(afrnx) 

m=l 


(4) 


with 


T 

<*imx 


=  E“i 


(ftx . 

*m*J 


(5) 


y=i 

and  the  /,-m  single  valued  (ridge)  functions  of  a  single  variable.  Instead  of  modeling  ptyh 
response  as  a  linear  combination  of  the  predictor  variables  (as  in  linear  regression),  PPR 
models  each  one  as  a  sum  of  functions  of  linear  combinations  of  the  predictor  variables. 
The  parameters  of  the  linear  combinations  ajm  as  well  as  the  functions  /,-m  are  chosen  to 
simultaneously  minimize  the  expected  distance  between  Yi  and  F,-.  Friedman  and  Steutzle 
(1981)  proposed  an  algorithm  for  approximately  minimising 


L*(afx  •  •  •  ,  fn  •  •  •  fiMi)  =  E[Yi  -  Fi]2, 

A 

with  Yi  given  by  (4).  They  also  proposed  a  forward  stagewise  procedure  for  choosing 
PPR  can  be  expected  to  perform  better  than  linear  regression  in  those  situations  where 
there  are  substantial  nonlinearities  in  the  dependence  of  the  responses  on  the  predictor 
variables,  especially  if  the  nonlinearities  are  approximated  reasonably  well  by  a  few  ridge 
functions  (functions  that  vary  in  only  one  direction  in  i2p).  PPR  approximations  are  dense 
in  the  sense  that  any  function  of  p  variables  can  be  arbitrarily  closely  approximated  by 
ridge  function  expansions  (4)  for  large  enough  Mi  (Diaconis  and  Shashahani,  1984), 

PPR  was  originally  intended  for  (and  presented  in  the  context  of)  a  single  response 
variable  (g  =  1).  For  the  case  of  several  responses  (g  >  1)  PPR  models  (4)  can  be 
cumbersome  due  to  the  large  number  of  functions  and  linear  combinations  involved.  Also, 
the  variance  associated  with  estimating  this  many  functions  and  parameters  can  be  high 
for  all  but  very  large  samples,  due  to  overfitting. 

This  paper  presents  a  generalization  of  the  PPR  model  suitable  for  multiple  response 
regression.  This  generalization  (termed  SMART  for  Smooth  Multiple  Additive  Regression 
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Technique)  takes  the  form 


M 

Vi*!  •••*,)=  2  (!<'<»)•  (6) 

m=l 

with  Y i  =  £Yi,  Efm  =  0,  i?/^,  =  1  and  a^am  =  1.  Here  each  response  variable 
is  modeled  as  a  linear  combination  of  predictor  functions  fm  (1  <  m  <  M).  Each  of 
these  predictor  functions  is  a  (smooth  but  otherwise  unrestricted)  ridge  function  in  the 
predictor  variables,  i.e.  a  function  of  a  linear  combination  of  the  predictors.  An  algorithm 
is  presented  for  minimizing 

1=1 

with  respect  to  the  response  linear  combinations  /3^  =  (/?lm  •  •  •  f3qm),  the  predictor  linear 
combinations  =  (cnm  •  •  •  apm),  and  the  functions  /»( 1<  m  <  M)  with  Y{  given  by 
(6).  The  (non- negative)  response  weights  W{  (1  <  t  <  q),  specified  by  the  user,  permit 
some  flexibility  in  the  specification  of  a  loss  metric  (see  below).  (It  is  possible  to  specify  a 
more  general  quadratic  form  for  the  response  loss  metric  than  (7);  this  would  be  represented 
by  a  general  positive  definite  symmetric  matrix.) 

SMART  models  (6)  contain  PPR  models  (4)  as  a  special  case.  They  often  can  be  much 
more  parsimonious  however,  by  capturing  the  dependence  of  the  response  variables  with 
many  fewer  functions.  This  is  especially  true  when  there  is  a  high  degree  of  association 
among  the  responses.  For  the  case  of  a  single  response  (q  =  1)  both  models  have  the  same 
form.  They  differ,  however  in  that  SMART  chooses  estimates  that  minimize  (7)  whereas 
PPR  chooses  the  (1  <  m  <  M)  in  a  forward  stagewise  manner.  This  can  result  in 
considerably  different  models,  especially  when  there  are  strong  associations  among  the 
predictor  variables. 

Expected  values  are  computed  from  the  data  as 

N  If 

E\z\  =  ^2  v>kZk/  ^2  Wk  (8) 

*=i  *= i 

where  Z  is  considered  to  be  a  random  variable  and  Zk  (1  <  k  <  N)  are  its  realized  values 
comprising  the  data.  The  observation  weights  Wk ,  specified  by  the  user,  can  be  employed  to 
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assign  differing  mass  to  different  observations.  They  can  also  be  used  to  implement  iterative 
reweighting  schemes  for  robust ificat ion  or  approximate  maximum  likelihood  fitting. 

As  with  any  distance  measure,  the  squared  error  loss  criterion  (7)  is  sensitive  to  the 
relative  scales  of  the  response  variables  F*.  The  influence  of  each  response  is  in  proportion 
to  its  variance  t/ar(F,-).  If  the  goal  is  to  give  each  response  equal  importance  in  the  loss 
function  (7),  then  one  can  set  W{  =  1/  var  (F,)  or  rescale  the  response  variables  to  have 
equal  variance. 

2.  Classification 


Classification  is  closely  related  to  regression.  Here  a  single  response  variable  Y  assumes 
several  categorical  (unorderable)  values  (c1?  c2,  •  •  • ,  cq).  The  loss  criterion  is  usually  taken 
to  be  the  misclassification  risk 


R  =  E[  min  £  /,y  p{i  \XuX2r--  Xp)\ 


(9) 


where  /,-y  is  the  (user  specified)  loss  for  predicting  Y  =  cy  when  its  true  value  is  ct-  (/,-,•  =  0). 
The  conditional  probability  p(i  |  xx  •  •  •  xp)  is  the  probability  that  F  =  c,-  given  a  particular 
set  of  values  for  the  predictor  variables  xi  •  •  •  xp.  The  sum  in  (9)  is  simply  the  loss  for 
predicting  Y  =  cy  given  a  set  of  predictor  values.  The  minimization  operation  provides  a 
decision  rule  that  minimizes  this  loss  at  each  set  of  predictor  values.  The  risk  is  then  the 
expected  or  average  loss  using  this  optimal  decision  rule.  The  art  of  classification  is  to  find 
estimates  of  the  conditional  probabilities  that  minimize  the  misclassification  risk. 

Defining  category  (class)  indicator  variables  for  each  observation  k  as 

h..  =  r  1  if  yfe  =  C,  l<k<N 

*  (  0  otherwise  1  <  i  <  q 


one  has 


V(i  \x1--  xp)  =  ^E{Hi  |  x,  •  ■  •  x„| 
St 


(10) 


N 


k=l 


with  jrf  the  unconditional  (prior)  probability  that  F  =  ct-  (Hi  =  1),  s,-  =  ^iofc<5(yfc,cf), 
and 

v 

S  =  ^  ]  Sj.  Here  8  is  the  Kronecker  delta  function 


f=i 


1  if  a  =  6 


5(a,  6)  =  (  *  ' 

1  0  otherwise. 
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Substituting  (10)  into  (9)  one  has 


«  =  E  17  1  Xl '  ■  ■  *pH  -  (11) 

From  this  one  sees  that  the  optimal  decision  rule  for  a  given  set  of  predictor  values  x\  •  •  •  xp 
is  to  assign  Y  =  cj»  where  J*  is  the  integer  value  (1  <  J*  <  q )  that  minimizes  the  sum  in 
(11). 

When  the  prior  probabilities  it,-  (1  <  *  <  q )  are  unknown,  they  can  be  estimated  from 
the  data  as  jt,-  =  S{/S.  Often  the  losses  /,y  are  taken  to  be  simply  /,y  =  1  —  When 

both  of  these  situations  occur  the  misclassification  risk  reduces  to  simply,  the  misclassifi- 
cation  probability. 

SMART  models  the  condition  expectations  (10,  11)  in  the  form  given  by  (6).  Ideally 
the  parameter  and  function  estimates  should  be  chosen  using  the  misclassification  risk  R 
(11)  as  a  distance  measure..  However,  as  discussed  in  Breiman,  Friedman,  Olshen  and 
Stone  (1983)  (see  also  Efron,  1978),  this  can  lead  to  difficulties  due  to  the  non-convexity 
of  R  (11).  A  good  surrogate  is  the  Euclidean  distance  L 2  (7)  with 


3.  Optimization  of  least  squares  criterion  for  SMART  models 


This  section  discusses  the  minimization  of  L 2  (6,  7)  simultaneously  with  respect  to 
ocjm  (1  <  j  <  p),  Pim  (1  <  *  <  q)  and  the  functions  fm  (1  <  m  <  M)  for  a  given  number 
of  terms  M.  (A  method  for  choosing  M  is  discussed  in  the  next  section.)  An  alternating 
optimization  strategy  is  used.  The  parameters  are  grouped  such  that  the  solution  for 
those  in  each  group  is  straightforward  given  fixed  values  for  those  outside  the  group.  A 
solution  is  obtained  for  the  variables  in  a  group  and  these  solution  values  replace  their 
current  values.  Attention  is  then  focused  on  the  next  group  and  this  process  repeated  for 
its  parameters.  After  solutions  have  been  obtained  for  all  groups  of  parameters,  another 
pass  is  made  over  the  groups  obtaining  new  solution  values,  given  the  new  values  for  the 
parameters  outside  each  group  that  were  obtained  in  the  previous  pass.  These  passes  are 
repeated  until  the  loss  criterion  L 2  (7)  fails  to  decrease  on  two  consecutive  passes.  Usually 
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a  threshold  e  is  set  at  a  small  value  and  if  improvement  on  two  consecutive  passes  is  W? 
than  e,  iterations  are  stopped  and  the  parameter  values  at  that  point  taken  as  the  solution. 
Since  at  each  step  in  this  process  L 2  is  made  smaller  through  a  partial  minimization,  and 
L2  >  0,  the  alternating  optimization  must  converge  (provided  e  is  large  compared  to  the 
numerical  accuracy  of  the  computer’s  arithmetic).  However,  there  is  no  guarantee  that  the 
solution  is  the  global  minimum  of  L 2.  It  may  be  a  local  minimum.  Strategy  for  dealing 
with  this  problem  in  the  context  of  SMART  modeling  is  discussed  in  the-next  section. 

The  parameter  grouping  used  in  the  SMART  algorithm  is  hierarchical.  The  first  level 
grouping  is  by  term.  The  parameters  ajm  (1  <  j  <  p),  pim  (1  <  t  <  q)  and  the  function 
fm  (for  fixed  m)  form  each  group.  There  are  obviously  M  such  groups.  At  the  second 
level  the  parameters  of  each  term  are  divided  into  three  groups:  the  arym  (1  <  j  <  p)  form 
the  first  (sub)  grouping,  the  /?,m  (1  <  t  <  q)  form  the  second  and  the  function  fm  forms 
the  third. 

Consider  a  particular  term,  k  (1  <  k  <  M).  The  loss  criterion  (6,  7)  can  be  reexpressed 
as 

h  =  4*°  =  X;H',  E{Ri(k)  -  feA(aJX)]2  (13) 

1=1 

with 

RiW=Yi-7l-'£pimf„(alX)  (14) 

m-jtk 

Equation  13  isolates  the  kth  term’s  contribution  to  the  criterion.  Following  the  alter¬ 
nating  optimization  strategy  we  minimize  L%  with  respect  to  the  parameters  of  the 

kth  term.  These  parameter  values  are  then  used  to  help  define  Ri(k'),k'  ±  k,  to  obtain 
new  solutions  for  the  parameters  of  other  terms.  Repeated  passes  are  made  over  all  the 
terms  until  convergence  (L%  stops  decreasing-see  above). 

We  now  focus  on  obtaining  solutions  for  the  parameters  of  the  kth  term  given  R^k) 
(14).  The  solutions  for  the  /?,&  (given  fk  and  atj£)  are  straightforward 


g. 

P,t 


(1  <  >  <  ?) 


(Remember  that  ^[R,(fc)]  =  E[fk{aJ.X)\  =  0). 


(15) 
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The  solution  for  the  function  fk  (given  /?£  and  ak)  is  almost  as  easily  obtained. 
Reexpressing  L ^  (13)  as 

4k>  =  E«lx  Wi(R,w  -  ft*/*)2  I  clx] ,  (16) 

1=1 

we  see  that  it  is  minimized  if  fk  is  chosen  to  minimize  the  conditional  expectation  in  16 
for  each  value  of  akx.  This  is  accomplished  by 


/*(<£*)  =  I  (17) 

*=1  1=1 

Since  we  require  Efk  =  0  and  Efj*  =  1,  we  standardize  /£,  rendering  the  denominator  in 
(17)  irrelevant. 

It  remains  to  find  a  solution  that  minimizes  L ^  (13)  with  respect  to  aj.  =  (ai*, 
dak,  •  •  •  apfc)  given  values  for  (3ik  (1  <  *  <  q )  and  a  (fixed)  function  fk.  Unlike  the  other 
parameters  (/?*  and  fk),  ctk  does  not  enter  in  a  purely  quadratic  way  into  the  distance 
criterion.  Therefore,  solutions  may  not  be  unique,  and  they  cannot  be  obtained  in  a  single 
step.  An  iterative  numerical  optimization  must  be  performed. 

The  loss  criterion  Lq  (6,  7,  13)  can  be  expressed  in  the  generic  form 


£*(<*»)  =  X>i£[ji(afc)]2 

t=l 


(18) 


with 

ffi(ak )  =  (-R* (k)  ~  Pikfk{a'k^'f)  (1®) 

The  classical  numerical  optimization  technique  for  criteria  of  the  form  (18)  is  the  Gauss- 
Newton  method  (see  Gill,  Murray  and  Wright,  1981,  Section  4.7).  Let  = 

’  >  apj^)  be  a  trial  se*  °f  values  at  some  point  during  the  optimization.  The  Gauss- 
Newton  estimate  for  the  solution  a %  (the  next  set  of  trial  values  in  the  iterative  process)  is 
ock  —  where  the  vector  AT  is  the  solution  to  the  set  of  simultaneous  equations 


E^K 


dj9i 

dock 


-E^K 

*=l 


dgi 

dak 


(20) 
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The  function  g{  and  the  vector  of  partial  derivatives  are  evaluated  at  a[°\  From  (19)  one 
has 

=  -Wh(a\°)T  X)X  (21) 

where  f(z)  =  df/dz.  After  solving  (20)  for  A,  a*  replaces  and  the  process  can  be 
repeated  until  convergence  (L2  stops  decreasing). 

It  is  possible  that  a  Gauss-Newton  step  fails  to  decrease  L2  (Z2(a^+  A)  >  X2(a^)). 
In  this  case  the  step  is  cut  in  half  (a*  =  +  A/2).  If  this  new  step  still  results  in  an 

increase  in  L2,  the  step  is  cut  again  (a*  =  aj^  +  A/4).  This  repeated  cutting  of  the  step 
is  continued  until  L2  decreases.  Since  the  matrix  on  the  left- hand-side  of  (20)  is  positive 
definite,  A  =  A/  |  A  |  is  a  valid  descent  direction  and  at  some  point  the  step  cutting  must 
give  rise  to  a  decrease  in  L2  (unless  aj.0^  represents  a  minimum  of  L2). 

The  nonparametric  estimates  for  the  the  functions  fk(a^x)  are  stored  as  an  ordinate 
and  abscissa  value  for  each  observation.  The  derivative  estimates  /^(a^®)  are  similarily 
stored  (see  below).  These  values  are  obtained  when  /^(a^a:)  is  evaluated  (17).  When 
a4°)T  is  changed  to  a*  (via  Gauss-Newton  update),  an  interpolation  scheme  must  be 
employed  to  obtain  values  for  /^(a^x)  from  fk{a^Tx).  This  interpolation  is  almost  as 
expensive  as  obtaining  the  optimal  function  for  the  new  argument  a^x.  We,  therefore,  do 
not  iterate  the  Gauss-Newton  stepping  until  convergence  for  a  given  function,  but  rather 
take  only  a  single  step.  A  new  (optimal)  function  /fc[(arj^r  +  Ar)x]  (17)  is  evaluated,  and 
the  next  Gauss-Newton  step  (19-21)  is  made  based  on  this  new  function.  Step  cutting,  as 
described  above,  is  employed  for  bad  steps.  In  this  way  both  the  function  and  the  predictor 
linear  combination  for  the  k  —  th  term  are  simultaneously  optimized  by  the  Gauss-Newton 
iteration  procedure. 

The  expected  values  i?[-]  are  easily  evaluated  via  (8).  The  conditional  expectation 
estimates  (17)  for  evaluation  of  the  optimal  functions  are  more  difficult.  The  method  used 
here  is  described  in  detail  in  Friedman  (1984a).  The  derivative  estimates  (21)  are  made 
by  taking  first  differences  of  the  function  estimates 


m ar m) .  (2<i<jv-i) 

otjfeWi  ~  z/-i) 


(22) 
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where  the  zj  are  labeled  in  increasing  order  of  a£z.  Endpoints  (/  =  1  and  l  =  N) 
are  handled  by  simply  copying  the  values  of  their  nearest  neighbors.  Such  estimates  can 
become  unstable  if  the  denominator  becomes  too  small.  This  can  be  avoided  by  pooling 
observations  for  which 


|a£(x,-z,,)|  <el  (1  </,/'<  AT)  (23) 

into  a  single  observation  for  the  purpose  of  derivative  calculation.  Here  I  is  the  semi- 
interquartile  range  of  a%x  and  e  is  a  small  number  (e  ~  0.05).  This  pooling  can  be  done 
rapidly  by  using  a  method  similar  to  the  pooled- adjacent-violators  algorithm  for  isotone 
regression  (Kruskal,  1964). 

4.  Modeling  Strategy 

The  principal  task  of  the  user  is  to  choose  Af  (6)  the  number  of  predictive  terms  com¬ 
prising  the  model.  Increasing  the  number  of  terms  decreases  the  bias  (model  specification 
error)  at  the  expense  of  increasing  the  variance  of  the  (model  and  parameter)  estimates. 
Since  the  expected  squared  error,  ESE,  is  the  sum  of  these  two  effects  -  ESE  =  (bias)2  + 
variance,  there  is  an  optimal  value  for  Af .  Sample  reuse  techniques  can  be  used  to  estimate 
these  effects  -  ESE  through  cross-validation  (Stone,  1977)  and  (Geisser,  1975),  and  vari¬ 
ance  through  bootstrapping  (Efron,  1983).  It  is  possible  to  implement  these  procedures  in 
conjunction  with  SMART  with  the  aim  of  estimating  an  optimal  value  for  Af  as  well  as 
confidence  intervals  for  estimates. 

Since  the  variance  tends  to  increase  more  or  less  linearly  with  increasing  Af  while 
the  (bias)2  tends  to  drop  rapidly  for  small  (increasing)  Af,  leveling  off  to  a  slow  decrease 
for  larger  Af,  a  good  estimate  for  the  optimal  Af  value  can  usually  be  made  by  simply 
inspecting  L 2  vs.  Af  for  various  values  of  Af.  That  point  at  which  a  unit  decrease  in  Af 
leads  to  a  relatively  large  increase  in  L?  (compared  to  that  for  close-by  larger  Af  values)  is 
often  a  good  choice.  Since  the  ESE  tends  to  vary  slowly  as  a  function  of  Af  in  the  region 
near  the  optimal  Af  value  (especially  on  the  side  of  increasing  Af),  the  choice  is  not  critical 
provided  it  is  not  too  small. 

For  a  given  value  of  Af,  solutions  (minimizing  L 2)  may  not  be  unique.  Sometimes 


9 


there  are  local  minima  that  can  trap  the  SMART  algorithm  thereby  masking  a  better 
global  minimum.  Such  local  minima  represent  solutions  that  are  relevant  to  larger  (higher 
M)  models.  Solutions  are  not  necessarily  found  in  optimal  order  as  M  is  increased.  This 
suggests  a  backwards  stepwise  model  selection  procedure. 

The  strategy  is  to  start  with  a  relatively  large  value  of  M  (say  M  =  ML)  and  find 
all  models  of  size  Ml  and  less.  That  is,  solutions  that  minimize  L2  are  found  for  M  = 
Ml,  Ml  —  1,  Ml  —  2,  • ,  1  in  order  of  decreasing  M.  The  starting  parameter  values 

for  the  numerical  search  in  each  Af-term  model  are  the  solution  values  for  the  M  most 
important  (out  of  M  +  1)  terms  of  the  previous  model.  Term  importance  is  measured  as 

4n  =  yi  Wi  |  Pim  |  (l  <  m  <  M)  (24) 

*=i 

normalized  so  that  the  most  important  term  has  unit  importance. 

(Note  that  the  variance  of  all  /mis  one.)  The  starting  point  for  the  minimization  of  the 
largest  model,  M  =  Ml,  is  given  by  an  Ml  term  stagewise  model  (Friedman  and  Stuetzle, 
1981). 

The  sequence  of  solutions  generated  in  this  manner  is  then  examined  by  the  user  and 
a  final  model  is  chosen  according  to  the  guidelines  above. 


5.  Relative  Importance  of  Predictor  Variables 


It  is  often  useful  to  have  an  idea  of  the  relative  importance  of  each  predictor  variable  to 
the  final  model.  For  (single  response)  linear  models  an  often  used  measure  is  the  absolute 
value  of  the  corresponding  regression  coefficient  ay  times  a  scale  measure  of  the  predictor 
variable  cry,  Jy  =  try  |  ay  |,  (1  <  j  <  p ).  A  corresponding  relative  importance  measure  for 
(multiple  response)  nonlinear  models  would  be 

Q 


dYj 
dXi 


Ij  =  E 


(i  < ;  <  p) 


with  Y,  =  E[Yi  |  Xi  •  •  •  xp].  For  SMART  models  (6)  this  becomes 

I,  =  YiW,E\  jr  |  (1  <  j  <  p) 


(25) 


m=  1 
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where  f'm(z)  =  dfm/dz  (22).  In  the  case  of  only  one  term,  M  =  1,  (25)  is  equivalent 
to  Ij  =  Oj  |  a.j  |.  It  is  important  to  keep  in  mind  that  the  same  care  is  required  in 
interpreting  (25)  as  in  the  corresponding  interpretation  of  regression  coefficients  in  linear 
models,  especially  in  the  presence  of  high  collinearity  among  the  predictor  variables. 

5.  Examples 

In  this  section  we  show  and  discuss  the  results  of  applying  the  procedure  described  in 
the  previous  sections  to  several  data  sets.  The  purpose  here  is  to  illustrate  the  functioning 
of  the  procedure  and  to  provide  a  little  insight  into  the  interpretation  of  results.  They  are 
not  intended  as  definitive  or  complete  analyses  of  these  data. 

The  first  example  illustrates  the  use  of  the  algorithm  in  an  approximation  rather 
than  an  estimation  mode.  The  purpose  is  to  approximate  a  single  function  ( q  =  1)  of  three 
variables  by  a  ridge  function  expansion  (4).  Thus,  there  is  no  noise  in  the  system,  e  =  0  (2). 
The  data  consist  of  200  randomly  generated  triangles  in  the  plane.  The  response  function 
was  taken  to  be  the  ratio  of  the  area  of  the  triangle  to  the  area  of  the  circumscribed  circle. 
The  predictor  variables  are  the  lengths  of  the  three  sides  of  the  triangle,  ordered  so  that 
the  first  variable  correspond  to  the  smallest  side,  the  second  to  the  middle,  and  the  third 
predictor  to  the  largest  side.  The  true  functional  form  is 

y=g(xux2,x3) 

_4[(xi  +  x2  +  3*3)  (3*2  +  Xz  —  xx)(xi  +  x3  —  gg)(g  1  +  x2  —  x3)]f  ,  » 

*(x  xx2x3)2  _ 

which  is  of  course  symmetric  in  the  three  variables.  This  complicated  expression  does 
not  have  an  exact  ridge  function  expansion.  The  purpose  of  the  exercise  is  to  see  if  the 
SMART  algorithm  can  find  a  parsimonious  ridge  function  expansion  that  provides  a  good 
approximation. 

Table  1  shows  the  fraction  of  unexplained  variance  e2  as  a  function  of  the  number 
of  terms  in  the  model  M.  Using  the  guidelines  of  Section  4  the  M  =  4  term  model  was 
chosen.  Table  2  shows  the  solution  linear  combinations  for  the  four  terms  as  well  as  the 
corresponding  importance  of  each  term  (24).  Table  3  presents  the  relative  importance  of 
each  predictor  variable  (side  length)  (25)  to  the  model.  Figures  la  —  Id  show  the  four 
predictor  functions  fm  (a£x)  (1  <  m  <  4)  corresponding  to  each  term.  The  functions 
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are  displayed  as  3catterplots  of  linear  combination  value  (abscissa)  versus  function  value 
(ordinate)  for  the  200  observations. 

Even  though  the  true  function  (26)  is  quite  complicated,  the  algorithm  was  able  to 
find  a  four  term  ridge  function  expansion  that  accounts  for  99.88%  of  its  variance.  The 
two  most  important  terms  involve  linear  combinations  that  are  close  to  those  appearing 
in  the  numerator  in  (26).  The  third  linear  combination  involves  X\  and  X3  while  the  last 
involves  X2  and  X3.  The  solution  function  corresponding  to  the  first  term  is  monotone 
and  nearly  linear;  the  next  two  are  highly  non  monotone  and  the  fourth  is  nearly  monotone 
but  highly  nonlinear.  All  three  variables  are  relatively  important  to  the  model  with  Xi 
and  X3  being  most  important.  Although  the  solution  ridge  function  expansion  is  very 
accurate,  it  is  unlikely  that  one  would  be  able  to  guess  the  correct  functional  form  (26) 
from  the  four  linear  combinations  (Table  2)  and  the  four  predictor  functions  (Figs,  la-ld). 

The  second  example,  although  involving  actual  data,  is  still  somewhat  contrived  to 
illustrate  the  functioning  of  the  algorithm.  It  consists  of  various  physico-chemical  proper¬ 
ties  of  the  52  chemical  elements  ranging  from  Lithium  (Li)  to  Xenon  (Xe)  in  the  periodic 
table  of  elements.  Four  of  these  properties  form  the  responses  (5  =  4);  Yi  =  first  ionization 
energy,  Y2  =  electronegativity,  Y3  =  covalent  radius,  and  K*  =  density  (Lewi,  1982).  The 
two  predictor  variables  (p  =  2)  are  locators  of  the  element  in  the  periodic  table;  X\  = 
atomic  number,  and  X2  =  atomic  group  number.  The  goal  is  to  see  how_accurately  one 
can  model  these  physico-chemical  properties  by  periodic  table  location,  what  form  this 
model  might  take,  and  whether  atomic  number  or  group  is  more  important  in  determining 
the  dependencies. 

To  aid  in  interpretation  both  the  four  response  and  two  predictor  variables  were 
standardized  to  have  zero  means  and  unit  variances  as  calculated  over  the  52  observations 
(elements).  The  response  weights  W{  (1  <  *  <  4)  (7)  were  all  set  to  unity.  The  accuracy 
of  the  fitted  model  is  expressed  in  terms  of  fraction  of  variance  unexplained,  defined  as 

«’  =  L*lY,wiE\Yi-Yi\*  (27) 

1=1 

with  <j  =  4,  L2  given  by  (7),  and  F,-  =  EY{. 
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Table  4  gives  the  fraction  of  unexplained  variance  e2  (27)  as  a  function  of  the  number 
of  terms  in  the  model.  Again,  the  guidelines  of  section  4  suggest  a  four  term  (M  =  4) 
model.  Table  5  shows  the  response  linear  combinations  /?,m  (1  <  *  <  4),  the  predictor 
linear  combinations  ajm  (1  <  j  <  2)  as  well  as  term  importance  Im  (24)  for  this  four  term 
model  (1  <  m  <  4).  Table  6  shows  the  fraction  of  unexplained  variance  for  each  response 
separately  for  this  model.  The  relative  importance  of  each  predictor  variable  //  (25)  was 
atomic  number  Ii  =  1.00,  group  / 2  =  0.38.  The  four  predictor  functions  corresponding  to 
the  four  terms  (Table  5)  are  shown  in  Figures  2a  —  2d. 

Since  the  cardinality  of  this  data  set  is  rather  small  ( N  =  52)  and  the  resulting  model 
rather  complex,  one  might  suspect  the  presence  of  considerable  overfitting.  Table  6  shows 
that  this  is  indeed  the  case.  The  last  column  of  this  table  shows  a  cross-validated  estimate 
of  .the  fraction  of  unexplained  variance  for  each  response  separately.  This  cross-validated 
estimate  is  obtained  by  removing  one  observation  at  a  time,  estimating  a  four  term  model 
on  the  remaining  ( N  =  51)  data,  and  computing  the  squared  residual  for  the  left-out 
observation  using  this  model.  The  last  column  of  Table  6  was  obtained  by  averaging  these 
squared  residuals  over  all  ( N  =  52)  observations  left  out  one  at  a  time.  Although  these 
cross-validated  results  still  show  considerable  explanatory  power  in  the  model,  we  see  that 
the  simple  resubstitution  estimate  of  the  squared-error  loss  is  about  3|  times  too  optimistic 
on  the  average  in  this  case. 

The  first  two  predictive  linear  combinations  (Table  5)  are  dominated  by  X\ ,  atomic 
number.  The  corresponding  functions  (Figs.  2a,  2b)  are  highly  nonlinear;  the  first  has  a 
periodic  saw-toothed  appearance  with  steeply  rising  slope  and  the  second  is  highly  oscil¬ 
latory.  The  third  function  involves  more  of  X%,  group  number,  and  is  also  very  nonlinear. 
The  fourth  function  is  dominated  by  X%  and  has  a  gentle  monotonic  dependence. 

On  the  basis  of  this  analysis  one  would  conclude  that  these  physico-chemical  properties 
do  depend  on  position  in  the  periodic  table,  but  in  a  highly  nonlinear  (periodic)  manner. 
Of  course,  this  is  already  well  known.  The  purpose  of  including  this  example  was  to  show 
that  the  SMART  algorithm  is  capable  of  modeling  such  severe  nonlinear  response  surfaces 
even  with  relatively  small  sample  size. 

The  final  example  is  a  classification  problem  involving  medical  data.  The  observations 
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consist  of  154  patients  with  chronic  hepatitis  (Efron  and  Gong,  1983).  The  purpose  of  this 
exercise  is  to  model  the  severity  of  the  disease  as  a  function  of  seven  clinical  measurements. 
These  measurements  include  the  age  and  sex  of  the  patient  as  well  as  the  blood  concen¬ 
trations  of  five  quantities  (Table  7).  The  response  is  binary  valued  indicating  whether  the 
patient  did  or  did  not  survive  the  illness.  In  the  training  sample  122  patients  survived 
(class  =  1),  while  32  did  not  (class  =  2).  Although  the  sample  size  ( N  =  154)  might  be 
regarded  as  moderate,  the  small  class  2  sample  size  dominates  the  statistical  aspects  of 
the  problem. 

SMART  classification  was  applied  to  these  data  with  the  purpose  of  constructing 
a  decision  rule  for  classifying  the  outcome  of  the  illness  based  on  the  predictor  variable 
values.  The  prior  probabilities  ir,-  (1  <  t  <  2)  (10,  11,  12)  were  estimated  to  be  the  sample 
proportions,  =  122/154,  jr2  =  32/154.  Since  a  conservative  diagnosis  is  usually  desired, 
the  loss  for  misclassifying  a  class  2  observation  as  class  1  (I2i )  was  set  to  four  times  that  for 
misclassifying  a  class  1  as  a  class  2  (/12);  specifically  l21  =  4.0  and  l12  =  1.0  (9, 11, 12).  The 
seven  predictor  variables  were  all  standardized  to  have  zero  expectation  and  unit  variance. 

Thble  8  shows  the  fraction  of  unexplained  variance  e2,  as  well  as  two  additional  quan¬ 
tities,  as  a  function,  of  the  numbers  of  terms  in  the  model.  These  additional  quantities 
are  two  different  estimates  of  the  misclassification  risk  associated  with  using  this  Af-term 
model  for  the  conditional  expectations  in  a  minimum  risk  decision  rule  (11).  The  first 
estimate  R\  (direct  resubstitution  risk  estimate)  is  obtained  by  classifying  each  training 
observation  k  (1  <  k  <  N)  using  the  minimum  loss  rule  (11) 


min 

i<y<? 


(E^ 

.=i  s* 


£|JJi  |  *!*•  ••*,.]} 


(28) 


and  then  computing  the  risk  by  averaging  the  loss  associated  with  the  resulting  misclassi- 
fications 

Ri=^2  WkS  ^2  lij;6{yk ,  c.)  /  wk.  (29) 

fc=i  »=i  s*  '  k=  i 

The  second  estimate  R2  (conditional  probability  risk  estimate)  is  the  value  of  R  (11) 
computed  by  substituting  the  conditional  expectation  estimates  of  this  (Af-term)  model 
directly  into  (11).  To  the  extent  that  the  conditional  expectation  (probability)  estimates 
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are  accurate  these  two  risk  estimates  should  have  similar  values.  However,  it  is  often 
possible  to  do  accurate  classification  in  the  presence  of  very  poor  probability  estimates. 
Comparing  the  values  of  Rx  and  R2  gives  some  indication  of  how  well  the  model  conditional 
expectation  estimates  are  approximating  the  true  underlying  probabilities.  If  Rx  is  much 
smaller  than  R%  (which  is  often  the  case)  then  the  probability  estimates  are  not  too  close. 

Using  the  guidelines  of  Section  4  a  three  term  (M  =  3)  model  was  chosen.  Table 
9  gives  the  solution  linear  combinations  a ^  and  the  importance  Im  (24)  for  each  term 
1  <  m  <  3.  Table  10  shows  the  relative  importance  of  each  predictor  variable  (25). 
Figures  3o  —  3c  show  the  three  predictor  functions  M  orj^z)  corresponding  to  each  model 
term  m  (1  <  m  <  3). 

The  resulting  model  misclassifies  24/122  ~  2096  of  the  survivors  (class  1)  and 

2/32  ~  696  of  the  nonsurvivors  (class  2).  The  goal  of  a  conservative  classification  rule  has 
been  achieved.  Since  the  sample  size  is  only  moderate  one  may  again  suspect  these  results, 
based  on  the  training  sample,  to  be  optimistic  estimates.  The  corresponding  cross-validated 
misclassification  results  are  33/122  ~  2696  and.  3/32  ~  996.  Although  indicating  some 
measure  of  overfitting,  these  cross-validated  results  indicate  that  a  substantial  dependence 
of  survivability  on  the  predictor  covariates  has  been  captured  by  the  model. 

The  predictor  functions  (Figs.  3a-3c)  are  substantially  nonlinear.  The  first  and  most 
important  term  is  mainly  a  function  of  variables  1  (sex)  and  7  (bilirubin).  For  values  of 
this  linear  combination  less  than  0.1  the  probability  of  survival  is  very  high.  For  values 
greater  than  0.1  this  probability  decreases  linearly  and  very  rapidly  with  increasing  value 
of  this  predictor  linear  combination. 

6.  Discussion 

The  examples  of  the  preceding  section  suggest  that  the  modeling  procedure  presented 
here  can  successfully  detect  and  model  highly  nonlinear  relationships  between  response 
and  predictor  variables.  Such  highly  non-linear  dependencies  are  not  characteristic  of  all 
situations.  In  these  cases  the  procedure  can  be  used  to  verify  their  non-presence.  This  is 
signified  by  the  need  for  only  a  single  ridge  function  ( M  =1)  with  nearly  linear  shape. 

SMART  models  are  not  the  only  nonlinear  generalizations  of  linear  regression  and 
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classification.  Other  generalizations  include  classification  and  regression  trees  (Breiman, 
Friedman,  Olshen  and  Stone,  1983),  ACE  (Breiman  and  Friedman,  1984)  and  other  gener¬ 
alized  additive  models  (Hastie  and  Tibshirani,  1984),  logisitic  regression  (Cox,  1970)  and 
nonlinear  link  functions  associated  with  generalized  linear  models  (McCullagh  and  Nelder, 
1983).  SMART  modeling  (6)  can  be  viewed  as  generalizations  of  some  of  these  (logistic 
regression,  generalized  linear  models)  in  the  sense  that  these  models  reduce  (or  nearly 
reduce)  to  special  cases  of  (6).  However,  several  other  of  the  above  listed  methods  rep¬ 
resent  different  generalizations  in  the  same  sense.  Only  classification  and  regression  trees 
(CART)  share  with  SMART  the  property  of  being  completely  nonparametric  in  that  any 
response  function  can  be  arbitrarily  well  approximated  given  a  large  enough  expansion. 
The  particular  form  chosen  for  SMART  models  was  motivated  by  the  desire  to  produce 
parsimonious  models  in  simple  situations  (nearly  linear  response  dependence  or  high  as¬ 
sociation  among  the  response  variables  )  along  with  the  ability  to  produce  more  complex 
models  for  those  situations  that  require  them.  — 

A  FORTRAN  program  (Friedman,  1984b)  implementing  SMART  regression  and  clas¬ 
sification  is  available  from  the  author. 
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Table  1 


Fraction  of  unexplained  variance  e2  as  a  func¬ 
tion  of  number  of  ridge  function  terms  M  for  tri¬ 
angle  example.  The  *  indicates  the  chosen  model. 


M 

6 

5 

4* 

3 

2 

1 


0.9  x  10“3 
1.0  x  10~3 
1.2  x  10"3 
3.9  x  10~3 
9.6  x  10~3 
3.8  x  IQ"2 


Table  2  . 


Predictor  linear  combination  and  relative 
term  importance  of  four  term  model  for  triangle 
example. 


Term  Importance 

1  1.00 

2  0.21 

3  0.16 

4  0.13 


al 

a  2 

a  3 

0.542 

0.502 

-0.674 

-0.506 

-0.689 

0.520 

0.385 

0.065 

-0.925 

0.003 

-0.674 

0.739 
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Table  3 


Relative  predictor  variable  importance  for  tri¬ 
angle  example. 

Variable  12  3 

Importance  0.93  0.68  1.00 

Table  4 

Fraction  of  unexplained  variance  e2  (27)  as  a 
function  of  number  of  ridge  function  terms  M  for 
the  atomic  element  example.  The  *  indicates  the 
chosen  model. 

M 
6 
5 

4* 

3 
2 
1 


.036 

.047 

.058 

.180 

.188 

.413 
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Table  5 


Linear  combinations  /3^,  and  term  importance  Im  of  the 
four  term  model  for  atomic  element  example 


Term 

(m) 

Im 

Am 

Am 

1 

1.00 

-0.43 

-0.40 

2 

0.53 

-0.02 

0.03 

3 

0.51 

0.08 

0.07 

4 

0.49 

0.17 

0.24 

Am 

Am 

®im 

®2m 

0.39 

0.29 

.98 

-0.18 

-0.33 

0.43 

.93 

-0.36 

0.39 

-0.24 

°o 

-0.54 

-0.11 

0.21 

.39 

0.92 

Table  6 


Fraction  of  unexplained  variance  for  each  response  variable 
e?  (1  <  »  <  4)  for  the  four  term  model.  Cross- validated  results 
e?(cv)  are  also  shown. 


♦ 

l 

Response  variable 

4 

ef(cv) 

1 

first  ionization  energy 

.094 

.20 

2 

electonegativity 

.054 

.16 

3 

covalent  radius 

.046 

.27 

4 

density 

.038 

.18 

Table  7 


Predictor  variables  X:-  (1  <  j  <  7)  used  in  hepatitis  example. 


Variable 

number 

1 

2 

3 

4 

5 

6 
7 


Variable 

name 

sex 

albumin 

proteim 

age 

SGOT 

alkphos 

bilirubin 


Table  8 

fraction  of  unexplained  variance  e2,  direct  resubstitution  risk 
estimate  Ri ,  and  conditional  probability  risk  estimate  Rq  as  a  func¬ 
tion  of  number  of  ridge  function  terms  M  for  hepatitis  classification 
example.  The  *  indicates  the  chosen  model. 


M 

e2 

Ri 

R 2 

4 

.47 

.16 

.31 

3* 

.49 

.21 

.33 

2 

.57 

.28 

.37 

1 

.60 

.24 

.30 
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Table  9 


Predictor  linear  combinations  ar^  and  relative  term  impor¬ 
tance  Im  of  three  term  model  for  hepatitis  example. 


Term 

Im 

®lm 

Q2m 

m 

®4m 

®5m 

®7m 

(m) 

1 

1.00 

-.67 

-.31 

.03 

.28 

-.16 

.19 

.55 

2 

.65 

-.09 

-.68 

-.36 

-.11 

.03 

.27 

-.55 

3 

.48 

-.05 

-.15 

.83 

-.17 

-.10 

00 

• 

.13 

Table  10 


Relative  predictor  variable  importance  for  hepatitis  example. 


Variable  1  2  3  4  5  6  7 

Importance  .74  1.0  .68  .45  .21  .44  .97 
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Figure  Captions 


Figure  la 
Figure  lb. 
Figure  lc. 
Figure  Id. 
Figure  2  a. 
Figure  2b. 
Figure  2c. 
Figure  2d: 
Figure  3  a. 
Figure  3b. 
Figure  3c. 


Triangle  data:  Term  1  predictor  function 
Triangle  data:  Term  2  predictor  function 
Triangle  data:  Term  3  predictor  function 
Triangle  data:  Term  4  predictor  function 
Periodic  table  data:  Term  1  predictor  function 
Periodic  table  data:  Term  2  predictor  function 
Periodic  table  data:  Term  4  predictor  function 
Periodic  table  data:  Term  4  predictor  function 
Chronic  hepatitis  data:  Term  1  predictor  function 
Chronic  hepatitis  data:  Term  2  predictor  function 
Chronic  hepatitis  data:  Term  3  predictor  function 
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Figure  la 


Figure  lb. 
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Figure  Id. 
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Figure  2a. 
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Figure  2c 


Figure  2d. 
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Figure  5a. 


Figure  3b. 
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Figure  5c. 
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